[SOUND]
So,
looking at the text mining problem more
closely, we see that the problem is
similar to general data mining, except
that we'll be focusing more on text data.
And we're going to have text mining
algorithms to help us to turn text data
into actionable knowledge that
we can use in real world,
especially for decision making, or
for completing whatever tasks that
require text data to support.
Because, in general,
in many real world problems of data mining
we also tend to have other kinds
of data that are non-textual.
So a more general picture would be
to include non-text data as well.
And for this reason we might be
concerned with joint mining of text and
non-text data.
And so in this course we're
going to focus more on text mining,
but we're also going to also touch how do
to joint analysis of both text data and
non-text data.
With this problem definition we
can now look at the landscape of
the topics in text mining and analytics.
Now this slide shows the process of
generating text data in more detail.
More specifically, a human sensor or
human observer would look at
the word from some perspective.
Different people would be looking at
the world from different angles and
they'll pay attention to different things.
The same person at different times might
also pay attention to different aspects
of the observed world.
And so the humans are able to perceive
the world from some perspective.
And that human, the sensor,
would then form a view of the world.
And that can be called the Observed World.
Of course, this would be different from
the Real World because of the perspective
that the person has taken
can often be biased also.
Now the Observed World can be
represented as, for example,
entity-relation graphs or
in a more general way,
using knowledge representation language.
But in general, this is basically what
a person has in mind about the world.
And we don't really know what
exactly it looks like, of course.
But then the human would
express what the person has
observed using a natural language,
such as English.
And the result is text data.
Of course a person could have used
a different language to express what he or
she has observed.
In that case we might have text data of
mixed languages or different languages.
The main goal of text mining
Is actually to revert this
process of generating text data.
We hope to be able to uncover
some aspect in this process.
Specifically, we can think about mining,
for example, knowledge about the language.
And that means by looking at text data
in English, we may be able to discover
something about English, some usage
of English, some patterns of English.
So this is one type of mining problems,
where the result is
some knowledge about language which
may be useful in various ways.
If you look at the picture,
we can also then mine knowledge
about the observed world.
And so this has much to do with
mining the content of text data.
We're going to look at what the text
data are about, and then try to
get the essence of it or
extracting high quality information
about a particular aspect of
the world that we're interested in.
For example, everything that has been
said about a particular person or
a particular entity.
And this can be regarded as mining content
to describe the observed world in
the user's mind or the person's mind.
If you look further,
then you can also imagine
we can mine knowledge about this observer,
himself or herself.
So this has also to do with
using text data to infer
some properties of this person.
And these properties could
include the mood of the person or
sentiment of the person.
And note that we distinguish
the observed word from the person
because text data can't describe what the
person has observed in an objective way.
But the description can be also
subjected with sentiment and so,
in general, you can imagine the text
data would contain some factual
descriptions of the world plus
some subjective comments.
So that's why it's also possible to
do text mining to mine
knowledge about the observer.
Finally, if you look at the picture
to the left side of this picture,
then you can see we can certainly also
say something about the real world.
Right?
So indeed we can do text mining to
infer other real world variables.
And this is often called
a predictive analytics.
And we want to predict the value
of certain interesting variable.
So, this picture basically covered
multiple types of knowledge that
we can mine from text in general.
When we infer other
real world variables we
could also use some of the results from
mining text data as intermediate
results to help the prediction.
For example,
after we mine the content of text data we
might generate some summary of content.
And that summary could be then used
to help us predict the variables
of the real world.
Now of course this is still generated
from the original text data,
but I want to emphasize here that
often the processing of text data
to generate some features that can help
with the prediction is very important.
And that's why here we show the results of
some other mining tasks, including
mining the content of text data and
mining knowledge about the observer,
can all be very helpful for prediction.
In fact, when we have non-text data,
we could also use the non-text
data to help prediction, and
of course it depends on the problem.
In general, non-text data can be very
important for such prediction tasks.
For example,
if you want to predict stock prices or
changes of stock prices based on
discussion in the news articles or
in social media, then this is an example
of using text data to predict
some other real world variables.
But in this case, obviously,
the historical stock price data would
be very important for this prediction.
And so that's an example of
non-text data that would be very
useful for the prediction.
And we're going to combine both kinds
of data to make the prediction.
Now non-text data can be also used for
analyzing text by supplying context.
When we look at the text data alone,
we'll be mostly looking at the content
and/or opinions expressed in the text.
But text data generally also
has context associated.
For example, the time and the location
that associated are with the text data.
And these are useful context information.
And the context can provide interesting
angles for analyzing text data.
For example, we might partition text
data into different time periods
because of the availability of the time.
Now we can analyze text data in each
time period and then make a comparison.
Similarly we can partition text
data based on locations or
any meta data that's associated to
form interesting comparisons in areas.
So, in this sense,
non-text data can actually provide
interesting angles or
perspectives for text data analysis.
And it can help us make context-sensitive
analysis of content or
the language usage or
the opinions about the observer or
the authors of text data.
We could analyze the sentiment
in different contexts.
So this is a fairly general landscape of
the topics in text mining and analytics.
In this course we're going to
selectively cover some of those topics.
We actually hope to cover
most of these general topics.
First we're going to cover
natural language processing very
briefly because this has to do
with understanding text data and
this determines how we can represent
text data for text mining.
Second, we're going to talk about how to
mine word associations from text data.
And word associations is a form of use for
lexical knowledge about a language.
Third, we're going to talk about
topic mining and analysis.
And this is only one way to
analyze content of text, but
it's a very useful ways
of analyzing content.
It's also one of the most useful
techniques in text mining.
Then we're going to talk about
opinion mining and sentiment analysis.
So this can be regarded as one example
of mining knowledge about the observer.
And finally we're going to
cover text-based prediction
problems where we try to predict some
real world variable based on text data.
So this slide also serves as
a road map for this course.
And we're going to use
this as an outline for
the topics that we'll cover
in the rest of this course.
[MUSIC]

